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Abstract 

We reduce the memory footprint of popular 
large-scale online learning methods by pro- 
jecting our weight vector onto a coarse dis- 
crete set using randomized rounding. Com- 
pared to standard 32-bit float encodings, this 
reduces RAM usage by more than 50% during 
training and by up to 95% when making pre- 
dictions from a fixed model, with almost no 
loss in accuracy. We also show that random- 
ized counting can be used to implement per- 
coordinate learning rates, improving model 
quality with little additional RAM. We prove 
these memory-saving methods achieve regret 
guarantees similar to their exact variants. 
Empirical evaluation confirms excellent per- 
formance, dominating standard approaches 
across memory versus accuracy tradeoffs. 



1. Introduction 

As the growth of machine learning data sets contin- 
ues to accelerate, available machine memory (RAM) 
is an increasingly important constraint. This is true 
for training massive-scale distributed learning systems, 
such as those used for predicting ad click through rates 
(CTR) for sponsored search (Richardson et al., 2007; 
Craswell et al., 2008; Bilenko & Richardson, 2011; 
Strceter & McMahan, 2010) or for filtering email spam 
at scale (Goodman et al., 2007). Minimizing RAM use 
is also important on a single machine if we wish to uti- 
lize the limited memory of a fast GPU processor, or to 
simply use fast Ll-cache more effectively. After train- 
ing, memory cost remains a key consideration at pre- 
diction time as real-world models arc often replicated 
to multiple machines to minimize prediction latency. 

This is an extend version of the paper of the same name 
which appeared in ICML 2013. The main addition is Ap- 
pendix A.3, which contains additional proofs. 



Efficient learning at peta-scale is commonly achieved 
by online gradient descent (OGD) (Zinkevich, 2003) 
or stochastic gradient descent (SGD), (e.g., Bottou & 
Bousquet, 2008), in which many tiny steps are accu- 
mulated in a weight vector /3 € M''. For large-scale 
learning, storing (i can consume considerable RAM, 
especially when datasets far exceed memory capacity 
and examples are streamed from network or disk. 

Our goal is to reduce the memory needed to store j3. 
Standard implementations store coefficients in single 
precision floating-point representation, using 32 bits 
per value. This provides fine-grained precision needed 
to accumulate these tiny steps with minimal roundoff 
error, but has a dynamic range that far exceeds the 
needs of practical machine learning (see Figure 1). 

We use coefficient representations that have; more lim- 
ited precision and dynamic range, allowing values to 
be stored cheaply. This coarse grid does not provide 
enough resolution to accumulate gradient steps with- 
out error, as the grid spacing may be larger than the 
updates. But we can obtain a provable safety guaran- 
tee through a suitable OGD algorithm that uses ran- 
domized rounding to project its coefficients onto the 
grid each round. The precision of the grid used on each 
round may be fixed in advance or changed adaptively 
as learning progresses. At prediction time, more ag- 
gressive rounding is possible because errors no longer 
accumulate. 

Online learning on large feature spaces where some 
features occur very frequently and others are rare often 
benefits from per-coordinate learning rates, but this 
requires an additional 32-bit count to be stored for 
each coordinate. In the spirit of randomized rounding, 
we limit the memory footprint of this strategy by using 
an 8-bit randomized counter for each coordinate based 
on a variant of Morris's algorithm (1978). We show the 
resulting regret bounds are only slightly worse than the 
exact counting variant (Theorem 3.3), and empirical 
results show negligible added loss. 
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Figure 1. Histogram of coefEcients in a typical large-scale 
linear model trained from real data. Values are tightly 
grouped near zero; a large dynamic range is superfluous. 

Contributions This paper gives the following theo- 
retical and empirical results: 

1. Using a pre-determined fixed-point representation 
of coefficient values reduces cost from 32 to 16 bits 
per value, at the cost of a small linear regret term. 

2. The cost of a per-coordinate learning rate sched- 
ule can be reduced from 32 to 8 bits per coordinate 
using a randomized counting scheme. 

3. Using an adaptive per-coordinate coarse represen- 
tation of coefhcient values reduces memory cost 
further and yields a no-regret algorithm. 

4. Variable-width encoding at prediction time allows 
coefficients to be encoded even more compactly 
(less than 2 bits per value in experiments) with 
negligible added loss. 

Approaches 1 and 2 are particularly attractive, as they 
require only small code changes and use negligible ad- 
ditional CPU time. Approaches 3 and 4 require more 
sophisticated data structures. 

2. Related Work 

In addition to the sources already referenced, related 
work has been done in several areas. 

Smaller Models A classic approach to reducing 
memory usage is to encourage sparsity, for example via 
the Lasso (Tibshirani, 1996) variant of least-squares 
regression, and the more general application of Li reg- 
ularizers (Duchi et al., 2008; Langford et al., 2009; 
Xiao, 2009; McMahan, 2011). A more recent trend 
has been to reduce memory cost via the use of feature 
hashing (Weinberger et al., 2009). Both families of ap- 
proaches are effective. The coarse encoding schemes 
reported here may be used in conjunction with these 
methods to give further reductions in memory usage. 

Randomized Rounding Randomized rounding 
schemes have been widely used in numerical com- 
puting and algorithm design (Raghavan & Tompson, 
1987). Recently, the related technique of random- 
ized counting has enabled compact language models 



(Van Durme & Lall, 2009). To our knowledge, this 
paper gives the first algorithms and analysis for online 
learning with randomized rounding and counting. 

Per-Coordinate Learning Rates Duchi et al. 
(2010) and McMahan & Streeter (2010) demon- 
strated that per-coordinate adaptive regularization 
(i.e., adaptive learning rates) can greatly boost pre- 
diction accuracy. The intuition is to let the learning 
rate for common features decrease quickly, while keep- 
ing the learning rate high for rare features. This adap- 
tivity increases RAM cost by requiring an additional 
statistic to be stored for each coordinate, most often 
as an additional 32-bit integer. Our approach reduces 
this cost by using an 8-bit randomized counter instead, 
using a variant of Morris's algorithm (Morris, 1978). 

3. Learning with Randomized 

Rounding and Probabilistic Counting 

For concreteness, we focus on logistic regression with 
binary feature vectors x £ {0, 1}'* and labels y S 
{0,1}. The model has coefficients /3 £ M'*, and gives 
predictions (x) = (t(/3 • x), where a{z) = l/(l-|-e~^) 
is the logistic function. Logistic regression finds the 
model that minimizes the logistic-loss C. Given a la- 
beled example {x, y) the logistic-loss is 

'C(x, y; 13) = -y log {pp (x)) - {1 - y) log (1 - pp (x)) 

where we take OlogO = 0. Here, we take log to be 
the natural logarithm. We define || the £r, norm 

of a vector x; when the subscript p is omitted, the 
£2 norm is implied. We use the compressed summa- 
tion notation gi-t = X]l=i 9s for scalars, and similarly 
fi:t{x) = fs{x) for functions. 

The basic algorithm we propose and analyze is a vari- 
ant of online gradient descent (OGD) that stores coef- 
ficients /3 in a limited precision format using a discrete 
set (eZ)''. For each OGD update, we compute each 
new coefficient value in 64-bit floating point represen- 
tation and then use randomized rounding to project 
the updated value back to the coarser representation. 

A useful representation for the discrete set (eZ)'' is 
the Qn.m fixed-point representation. This uses n bits 
for the integral part of the value, and m bits for the 
fractional part. Adding in a sign bit results in a total 
ofA'^n-l-m-fl bits per value. The value m may be 
fixed in advance, or set adaptively as described below. 
We use the method RandomRound from Algorithm 1 
to project values onto this encoding. 

The added CPU cost of fixed-point encoding and ran- 
domized rounding is low. Typically K is chosen to 
correspond to a machine integer (say if = 8 or 16), 
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Algorithm 1 OGD-Rand-ld 

input: feasible set T = [— ii, 7?], learning rate schedule 

7)t, resolution schedule et 

define fun Project = max(— i?, min(/?, Ji)) 
Initialize /3i = 
for t=l, . . . , T do 

Play the point fit, observe gt 
= Project {fit - Vtgt) 

Pt+i <— RandomRound(/9t+i, et) 

function RandomRound(/3, e) 

I b with prob. (B — a) It 

return < 

a otherwise 



so converting back to a floating point representa- 
tions requires a single integer-float multiplication (by 
e = 2"™). Randomized rounding requires a call to 
a pseudo-random number generator, which may be 
done in 18-20 flops. Overall, the added CPU overhead 
is negligible, especially as many large-scale learning 
methods are I/O bound reading from disk or network 
rather than CPU bound. 

3.1. Regret Bounds for Randomized Rounding 

We now prove theoretical guarantees (in the form of 
upper bounds on regret) for a variant of OGD that 
uses randomized rounding on an adaptive grid as well 
as per-coordinate learning rates. (These bounds can 
also be applied to a fixed grid). We use the standard 
definition 

T T 

Regret = V /t (/?()- arg min V /( (/3* ) 

given a sequence of convex loss functions ft- Here the 

/3t our algorithm plays are random variables, and since 
we allow the adversary to adapt based on the pre- 
viously observed /9t) the ft and post-hoc optimal /3* 
are also random variables. We prove bounds on ex- 
pected regret, where the expectation is with respect 
to the randomization used by our algorithms (high- 
probability bounds are also possible). We consider 
regret with respect to the best model in the non- 
discretized comparison class T = [—R,R]'''. 

We follow the usual reduction from convex to lin- 
ear functions introduced by Zinkevich (2003); see also 
Shalev-Shwartz (2012, Sec. 2.4). Further, since we 
consider the hypcr-rcctangle feasible set = [—R, R]'^, 
the linear problem decomposes into n independent 
one-dimensional problems.^ In this setting, we con- 
sider OGD with randomized rounding to an adaptive 

^Extension to arbitrary feasible sets is possible, but 



grid of resolution et on round t, and an adaptive learn- 
ing rate tjt- We then run one copy of this algorithm 
for each coordinate of the original convex problem, 
implying that we can choose the rjt and ct schedules 
appropriately for each coordinate. For simplicity, we 
assume the et resolutions are chosen so that —R and 
+R are always gridpoints. Algorithm 1 gives the one- 
dimensional version, which is run independently on 
each coordinate (with a different learning rate and dis- 
cretization schedule) in Algorithm 2. The core result 
is a regret bound for Algorithm 1 (omitted proofs can 
be found in the Appendix): 

Theorem 3.1. Consider running Algorithm 1 with 
adaptive non-increasing learning-rate schedule r]t, and 
discretization schedule et such that et < "fr/t for a con- 
stant 7 > 0. Then, against any sequence of gradi- 
ents gi,...,gT (possibly selected by an adaptive ad- 
versary) with \gt\ < G, against any comparator point 
e [-R,R], we have 

E[Regret(/3*)] < + hc^ + ^^)r^^,^ + ^Rs/f. 

By choosing 7 sufficiently small, we obtain an expected 
regret bound that is indistinguishable from the non- 
rounded version (which is obtained by taking 7 = 0). 
In practice, we find simply choosing 7 = 1 yields ex- 
cellent results. With some care in the choice of norms 
used, it is straightforward to extend the above result 
to d dimensions. Applying the above algorithm on a 
per-coordinate basis yields the following guarantee: 

Corollary 3.2. Consider running Algorithm 2 on 
the feasible set T = [—R,R]'^, which in turn runs 
Algorithm 1 on each coordinate. We use per- 
coordinate learning rates r]t,i = aj ^frtfi with a = 
V^i?/i/G^~+7^, where Tt,i <t is the number of non- 
zero ga,i seen on coordinaie i on rounds s = 1, . . . ,t. 
Then, against convex loss functions ft, with gt a sub- 
gradient of ft at $t! such thatyt, \\gt\\oo < G, we have 

E[Regret] < ^ (^2R^ 2Tr,i{G^ -\- J^) + 7^^^) • 

The proof follows by summing the bound from The- 
orem 3.1 over each coordinate, considering only the 
rounds when gt i 7^ 0, and then using the inequality 
J2t=i 1/ < 2y/T to handle the sum of learning rates 
on each coordinate. 

The core intuition behind this algorithm is that for fea- 
tures where we have little data (that is, r, is small, for 

choosing the hypcr-rcctangle simplifies the analysis; in 
practice, projection onto the fcEisible set rarely helps per- 
formance. 
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Algorithm 2 OGD-Rand 

input: feasible set T = [— -R, Ti]'*, parameters a,7 > 
Initialize /3i = e R'*; Vi, r, = 
for t=l, . . . , T do 

Play the point /3t, observe loss function 
for i=l, . . . , d do 
let 3t,i = \7ft{xt)i 
if gt^i = then continue 

Ti Tf + 1 

let r7t,i = a/-v/7V and et,i = 'yrjt.i 
Pt+i,i Project (pt,i - rit,zgt,t) 
Pt+i,i ^ RandomRound(/3t+i,i, et,i) 



example rare words in a bag-of-words representation, 
identified by a binary feature), using a fine-precision 
coefficient is unnecessary, as we can't estimate the cor- 
rect coefficient with much confidence. This is in fact 
the same reason using a larger learning rate is ap- 
propriate, so it is no coincidence the theory suggests 
choosing ct and % to be of the same magnitude. 

Fixed Discretization Rather than implementing 

an adaptive discretization schedule, it is more straight- 
forward and more efficient to choose a fixed grid res- 
olution, for example a 16-bit Qn.m representation is 
sufficient for many applications.^ In this case, one can 
apply the above theory, but simply stop decreasing the 
learning rate once it reaches say e (= 2"°"). Then, the 
rji-T term in the regret bound yields a linear term like 
0{eT); this is unavoidable when using a fixed reso- 
lution e. One could let the learning rate continue to 
decrease like l/\/i, but this would provide no benefit; 
in fact, lower-bounding the learning-rate is known to 
allow online gradient descent to provide regret bounds 
against a moving comparator (Zinkevich, 2003). 

Data Structures There are several viable ap- 
proaches to storing models with variable-sized coef- 
ficients. One can store all keys at a fixed (low) preci- 
sion, then maintain a sequence of maps (e.g., as hash- 
tables) , each containing a mapping from keys to coeffi- 
cients of increasing precision. Alternately, a simple lin- 
ear probing hash-table for variable length keys is effi- 
cient for a wide variety of distributions on key lengths, 
as demonstrated by Thorup (2009). With this data 
structure, keys and coefficient values can be treated as 
strings over 4-bit or 8-bit bytes, for example. Bland- 
ford & Blelloch (2008) provide yet another data struc- 
ture: a compact dictionary for variable length keys. 
Finally, for a fixed model, one can write out the string 

^If wc scale X — > 2x then we must take — > 0/2 to 
make the same predictions, and so appropriate choices of 
n and m must be data^dependent. 



s of all coefficients (without end of string delimiters), 
store a second binary string of length s with ones at 
the coefficient boimdaries, and use any of a number of 
rank/select data structures to index into it, e.g., the 
one of Patrascu (2008). 

3.2. Approximate Feature Counts 

Online convex optimization methods typically use a 
learning rate that decreases over time, e.g., setting rjt 
proportional to Per-coordinate learning rates 

require storing a unique count Ti for each coordinate, 
where r, is the number of times coordinate i has ap- 
peared with a non-zero gradient so far. Significant 
space is saved by using a 8-bit randomized counting 
scheme rather than a 32-bit (or 64-bit) integer to store 
the d total counts. We use a variant of Morris' prob- 
abilistic counting algorithm (1978) analyzed by Flajo- 
let (1985). Specifically, we initialize a counter C = 1, 
and on each increment operation, we increment C with 
probability p{C) = h^'~^ , where base 6 is a parameter. 
We estimate the count as f(C) = ^^-i ' which is an 
unbiased estimator of tlic^ true count. We then use 
learning rates rit,i = o:/ ^/ftj + 1, which ensures that 
even when ft.i = we don't divide by zero. 

We compute high-probability bounds on this counter 
in Lemma A.l. Using these bounds for rit^i in conjunc- 
tion with Theorem 3.1, we obtain the following result 

(proof deferred to the appendix) . 

Theorem 3.3. Consider running the algorithm, of 
Corollary 3.2 under the assumptions specified, there, 
hut using approximate counts fj in place of the exact 
counts Ti . The approximate counts are computed using 
the randomized counter described above with any base 
b > 1. Thus, ft^i is the estimated number of times 
9s,i j^O on rounds s = 1, . . . ,t, and the per-coordinate 
learning rates are r]t,i = aj ^ft^i + 1. With an appro- 
priate choice of a we have 

E [Regret (5)] = o (r^/CP^^T^-^+^'^ for all 5 > 

where the o-notation hides a small constant factor and 
the dependence on the base b.^ 

4. Encoding During Prediction Time 

Many real-world problems require large-scale predic- 
tion. Achieving scale may require that a trained model 
be replicated to multiple machines (Bucilua et al., 
2006). Saving RAM via rounding is especially at- 
tractive here, because unlike in training accumulated 

■^Eq. (5) in the appendix provides a non-asymptotic (but 
more cumbersome) regret bound. 
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Results on Public RCVl Text Classification Data 




32 -Bit Floats. Global Learning Rate 
32-Bit Fioets, Exact Count Per-Coordlnate Learning Rates 
32 -Bit Floats, Randomized Count Per-CoDrdlnste Learning Rates 
Fixed q2.13 Precision, Global Learning Rate 
Fixed Precision, Randomized Count Per-Coordlnate Learning Rates - 
Adaptlv^Preclslon, Randomized Count Per-Coordlnate Learning Rates ■ 



Resuits on Reai-World Ad CI lci< Through Rate Data 




32 -Bit Floats. Global Learning Rate 
32-Bit Floats, Exact Count Per-Coordlnate Learning Rates 
32 -Bit Floats. Randomized Count Per-Coordlnate Learning Rates 
Fixed Q2.13 Precision, Global Learning Rate 
Fixed Precision, Randomized Count Per-Coordlnate Learning Rates ■ 
Adaptlve^reclslon, Randomized Count Per-Coordlnate Learning Rates ■ 



Bits Per Value 

Figure 2. Rounding at Training Time. The fixed q2. 13 encoding is 50% smaller than control with no loss. Per-coordinate 
learning rates significantly improve predictions but use 64 bits per value. Randomized counting reduces this to 40 bits. 
Using adaptive or fixed precision reduces memory use further, to 24 total bits per value or less. The benefit of adaptive 
precision is seen more on the larger CTR data. 



roundoff error is no longer an issue. This allows even 
more aggressive rounding to be used safely. 

Consider a rounding a trained model /? to some /5. 
We can bound both the additive and relative effect on 
logistic-loss £(•) in terms of the quantity — 

Lemma 4,1 (Additive Error). Fix/3^/3 and{x^y). Let 
6 = \I3 • X — f3 • x\. Then the logistic-loss satisfies 

C{x,v-P)-C{x,y-P)<5. 



Proof. It is well known that 
x,y, (5 and i, which implies the result. 



dC{x,y;l3) 



< 1 for all 

□ 



Lemma 4.2 (Relative Error). Fix f3,/3 and {x,y) S 
{0,1}'' X {0,1}. LetS=\(3-x-i3-x\. Then 

£{x,yj) - C{x,y]l3) ^ _ ^ 
^x,y;f3) 

See the appendix for a proof. Now, suppose we are 
using fixed precision numbers to store our model co- 
efficients such as the Qn.m encoding described earlier, 
with a precision of e. This induces a grid of feasi- 
ble model coefficient vectors. If we randomly round 
each coefficient j3i (where \/3i\ < 2°) independently up 
or down to the nearest feasible value f3i, such that 



E[/3i] = Pi, then for any x G {0, 1}'' our predicted log- 
odds ratio, l3 ■ X is distributed as a sum of independent 
random variables {/3i | Xi = 1}. 

Let k = \\x\\q. In this situation, note that |/? • a; — /3 • 
2^1 ^ ^ll^^lli — since 1/3^-/3^1 < e for all i. Thus 
Lemma 4.1 implies 

C{x,yJ)-C{x,y;l3)<e\\x\\i. 

Similarly, Lemma 4.2 immediately provides an upper 
bound of e*^*^ — 1 on relative logistic error; this bound 
is relatively tight for small k, and holds with proba- 
bility one, but it does not exploit the fact that the 
randomness is unbiased and that errors should cancel 
out when k is large. The following theorem gives a 
bound on expected relative error that is much tighter 
for large k: 

Theorem 4.3. Let fj he a model obtained from /3 
using unbiased randomized rounding to a precision e 
grid as described above. Then, the expected logistic- 
loss relative error of l3 on any input x is at most 
2-\/27rfc exp (e^fc/2) e where k = \\x\\q. 

Additional Compression Figure 1 reveals that co- 
efficient values are not uniformly distributed. Stor- 
ing these values in a fixed-point representation means 
that individual values will occur many times. Basic 
information theory shows that the more common val- 
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Tabic 1. Rounding at Prediction Time for CTR Data. 
Fixed-point encodings are compared to a 32-bit floating 
point control model. Added loss is negligible even when 
using only 1.5 bits per value with optimal encoding. 



Encoding 


AucLoss 


Opt. Bits/Val 


Q2.3 


+5.72% 


0.1 


Q2.5 


-1-0.44% 


0.5 


Q2.7 


+0.03% 


1.5 


Q2.9 


+0.00% 


3.3 



ucs may be encoded with fewer bits. The theoret- 
ical J^ound for a whole model with d coefBcients is 
— '^i=i^^P^^^^ hits per value, where p{v) is the proba- 
bility of occurrence oi v in /3 across all dimensions d. 
Variable length encoding schemes may approach this 
limit and achieve further RAM savings. 

5. Experimental Results 

We evaluated on both public and private large data 
sets. We used the public RCVl text classification 
data set, specifically from Chang & Lin (2011). In 
keeping with common practice on this data set, the 
smaller "train" split of 20,242 examples was used for 
parameter tuning and the larger "test" split of 677,399 
examples was used for the full online learning exper- 
iments. We also report results from a private CTR 
data set of roughly 30M examples and 20M features, 
sampled from real ad click data from a major search 
engine. Even larger experiments were run on data sets 
of billions of examples and billions of dimensions, with 
similar results as those reported here. 

The evaluation metrics for predictions are error rate 
for the RCVl data, and AucLoss (or 1-AUC) relative 
to a control model for the CTR data. Lower values 
are better. Metrics are computed using progressive 
validation (Blum et al., 1999) as is standard for online 
learning: on each round a prediction is made for a 
given example and record for evaluation, and only after 
that is the model allowed to train on the example. We 
also report the number of bits per coordinate used. 

Rounding During Training Our main results are 

given in Figure 2. The comparison baseline is online 
logistic regression using a single global learning rate 
and 32-bit floats to store coefficients. We also test the 
effect of per-coordinate learning rates with both 32- 
bit integers for exact counts and with 8-bit random- 
ized counts. We test the range of tradeoffs available 
for fixed-precision rounding with randomized counts, 
varying the number of precision m in q2 . m encoding to 
plot the tradeoff curve (cyan). We also test the range 



of tradeoffs available for adaptive-precision rounding 
with randomized counts, varying the precision scalar 
7 to plot the tradeoff curve (dark red). For all ran- 
domized counts a base of 1.1 was used. Other than 
these differences, the algorithms tested are identical. 

Using a single global learning rate, a fixed q2 . 13 en- 
coding saves 50% of the RAM at no added loss com- 
pared to the baseline. The addition of per-coordinate 
learning rates gives significant improvement in predic- 
tive performance, but at the price of added memory 
consumption, increasing from 32 bits per coordinate to 
64 bits per coordinate in the baselines. Using random- 
ized counts reduces this down to 40 bits per coordi- 
nate. However, both the fixed-precision and the adap- 
tive precision methods give far better results, achiev- 
ing the same excellent predictive performance as the 
64-bit method with 24 bits per coefficient or less. This 
saves 62.5% of the RAM cost compared to the 64-bit 
method, and is still smaller than using 32-bit floats 
with a global learning rate. 

The benefit of adaptive precision is only apparent on 
the larger CTR data set, which has a "long tail" distri- 
bution of support across features. However, it is useful 
to note that the simpler fixed-precision method also 
gives great benefit. For example, using q2.13 encod- 
ing for coefficient values and 8-bit randomized counters 
allows full-byte alignment in naive data structures. 



Rounding at Prediction Time We tested the ef- 
fect of performing coarser randomized rounding of a 
fully-trained model on the CTR data, and compared to 
the loss incurred using a 32-bit floating point represen- 
tation. These results, given in Table 1, clearly support 
the theoretical analysis that suggests more aggressive 
rounding is possible at prediction time. Surprisingly 
coarse levels of precision give excellent results, with 
little or no loss in predictive performance. The mem- 
ory savings achievable in this scheme are considerable, 
down to less than two bits per value for q2 . 7 with the- 
oretically optimal encoding of the discrete values. 

6. Conclusions 

Randomized storage of coefficient values provides an 

efficient method for achieving significant RAM savings 
both during training and at prediction time. 

While in this work we focus on OGD, similar ran- 
domized rounding schemes may be applied to other 
learning algorithms. The extension to algorithms that 
efficiently handle Li regularization, like RDA (Xiao, 
2009) and FTRL-Proximal (McMahan, 2011), is rela- 
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lively straightforward.* Large scale kernel machines, 
matrix decompositions, topic models, and other large- 
scale learning methods may all be modifiable to take 
advantage of RAM savings through low precision ran- 
domized rounding methods. 
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A. Appendix: Proofs 

A.l. Proof of Theorem 3.1 

Our analysis extends the technique of Zinkevich 
(2003). Let /3* be any feasible point (with possibly in- 
finite precision coeSicients) . By the definition of /3t+i, 

riP-||/3t-/31l'-2r7,gt. (A -/?*)-' -^n. ,|2 



IIA+i 

Rearranging the above yields 

9t ■ 0t - n 



-Tlt\\9t\ 



< 



^(ii/3t-rii'-iiA+i-rf) 



2% 



/3* 



iiA+i-rii^) + |ii5tr+Pt, 



where the pt = ^ (||/3t+i - - - ) 

terms will capture the extra regret due to the random- 
ized rounding. Summing over t, and following Zinke- 
vich's analysis, we obtain a bound of 

{2Rf 



Regret (T) < 



htWl 



2m 



m-.T + Pl:T- 



It remains to bound pi;T- 
and at = dt/r]t, we have 



Letting dt = Pt+i - Pt+i 



Pv.T 



t+i 



< 



t+1 



) +R\ai.,T\ ■ 



2 



We bound each of the terms in this last expression 
in expectation. First, note \dt\ < et < jr/t by defi- 
nition of the resolution of the rounding grid, and so 



''Some care must be taken to store a discrctizod ver- 
sion of a scaled gradient sum, so that the dynamic range 
remains roughly unchanged as learning progresses. 



1^*1 < 7- Further E[(i(] = since the rounding is 
unbiased. Letting W = \ai;T\, by Jensen's inequal- 
ity we have E[W]^ < E[VK2]. Thus, E[|o,i:t|] < 
-^E[(ai:T)^] = -\/Var(ai:T), where the last equality 
follows from the fact E[ai:T] =0. The at are not inde- 
pendent given an adaptive adversary.^ Nevertheless, 
consider any and at with s < t. Since both have 
expectation zero, Cov{ag,at) = E[asat]- By construc- 
tion, E[at I gt, I3t,histt] = 0, where histt is the full 
history of the game up until round t, which includes 
Gs in particular. Thus 

Cov{as,at) = E[asat] = E[E[asat | g*, histt]] = 0. 

For all t, \at\ < 7 so Var(at) < 7^, and Var(ai:T) = 
Et Var(at) < j^T. Thus, E[|ai:T|] < -fVT. 

Next, consider E0t+i - A+i I A+ij- Since E[/3t+i | 
(3t+i] = l^t+i, for any shift s € M, we have E[(/3t+i — 
s f - (A+i - s)2 I A+i] = E[/32^i - I A+i] , and 
so taking s = A+i, 

^E[4\, - I A+i] = ^E[(A+i - I A+i] 



< — < 

Vt 



9 9 

TVt 
Vt 



= 7 Vt- 



Combining this result with E[|ai:T|] < jVT, we have 

E [pi-.t] < l^Vi-.T + iRVt, 
which completes the proof. □ 

A. 2. Approximate Counting 

We first provide high-probability bounds for the ap- 
proximate counter. 

Lemma A.l. Fix T and t < T. Let Ct+i be the 
value of the counter after t increment operations using 
the approximate counting algorithm described in Sec- 
tion 3.2 with base h > 1. Then, for all c > 0, the 
estimated count f(C(+i) satisfies 



Pr 



r{Ct+i) < 



t 



bc\og{T) 



1 



< 



n (1) 



and 



Pr 



f{Ct+l) > -^6V2clog,(T)+2 



< jTo- (2) 



Both T and c are essentially parameters of the bound; 
in the Eq. (2), any choices of T and c that keep T'^ 



^For example the adversary could ensure at+i = (by 
playing gt+i = 0) iff at > 0. 
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constant produce the same bound. In the first bound, 
the result is sharpest when T = t, but it will be con- 
venient to set T equal to the total number of rounds 
so that we can easily take a union bound (in the proof 
of Theorem 3.3). 



Proof of Lemma A.l. Fix a sequence of T increments, 
and let Ci denote the value of the approximate counter 
at the start of increment number i, so Ci = 1. Let 
Xj = \{i : d = & random variable for the number 
of increments for which the counter stayed at j. 

Wc start with the bound of Eq. (1). When C ~ j, the 
update probability is pj — p(j) = b^^ , so for any £j 
we have Xj > £j with probability at most (1 —pjY^ < 
cxp{—pjY-> = cxp(—pj£j) since (1 — x) < exp(— x) for 
all X. To make this at most T^^ it suffices to take 
ij — c{\ogT)/pj = cbHogT. Taking a (rather loose) 
union bound over j = 1, 2, . . . , T, we have 



Pr[3j, Xj > cVlogT] < l/T" 



For Eq. (1), it suffices to show that if this does not oc- 
cur, thenf(Ct) > t/{hc\og{T))-l. Note X;^^ > t- 
With our supposition that Xj < dP log T for all j, this 
implies t < J2%cb^logT = c61ogT(^^^), and 

thus Ct > logj ^ bciogT Since f is monotoni- 

cally increasing and b > 1, simple algebra then shows 

f{Ct+i) > riCV) > t/{bcAog{T)) - 1. 

Next consider the bound of Eq. (2). Let jo be the 
minimum value such that p{jo) < l/et, and fix fc > 0. 
Then Ct+i > jo + k implies the counter was incre- 
mented k times with an increment probability at most 
pUo)- Thus, 



/,\ io+fe-i 



te 



3=30 
k ( k-1 



k 



y3=0 

Pilot b-'^'-'^/' 



< k-"" ■ 6-'=('=-i)/2 

Note that jo < [log^ (et)] . Taking k = A/2clog5(T) + l 
is sufficient to ensure this probability is at most T~^, 
since k~'' < 1 and k"^ — k > 2clog{,T. Observing that 



(riog(,(et)l+V2clog,(T) + l) 
completes the proof. 



< 



fe-i 



5\/2clogfc(T)+2 



□ 



Proof of Theorem 3.3. We prove the bound for 
the one-dimensional case; the general bound then fol- 
lows by summing over dimensions. Since we con- 
sider a single dimension, we assume \gt\ > on all 
rounds. This is without loss of generality, because 
we can implicitly skip all rounds with zero gradients, 
which means we don't need to make the distinction be- 
tween t and Tt^i- We abuse notation slightly by defining 
ft = f{Ct+i) ~ t = Tt for the approximate count on 
round t. We begin from the bound 



E[Regret] < 



27JT 



of Theorem 3.1, with learning rates -qt = a/y/ft + 1. 
Lemma A.l with c = 2.5 then implies 

Pr[ft + 1 < kit] < ^ and Pr[ft > k2t] < 



where fci = l/(6clogr) and fcs = ebV!:^^^Ill_ a 
union bound on t = 1, T on the first bound implies 
with probability 1 — we have Vf, ft + l> kit, so 



Vl:T = 



< 



< 



iki 



(3) 



where we have used the inequality "^1^=1 — 2\/T'. 
Similarly, the second inequality implies with probabil- 
ity at least 1 — ^Jts, 



Vt 



I) 



> 



a 



Vtt + I ~ \/k2T + 1 ■ 



(4) 



Taking a union bound, Eqs. (3) and (4) hold with prob- 
ability at least 1 — 2/^/T, and so at least one fails with 
probability at most 2/VT. Since ftiP)-ft{P') < '^GR 
for any /3, (3' G [—R, R] (using the convexity of ft and 
the bound on the gradients G), on any run of the algo- 
rithm, regret is bounded by 2RGT. Thus, these failed 
cases contribute at most 4RGVT to the expected re- 
gret bound. 

Now suppose Eqs. (3) and (4) hold. Choosing a = 
minimizes the dependence on the other con- 



stants, and note for any 6 > 0, both and are 
o{T^). Thus, when Eqs. (3) and (4) hold. 



E[Regret] < + 1{G^+ l^)vi:t + iRVt 



< 



2r]T 2 
2R^s/k2T+l 



a 



+ {G^ + 1^) 



jrVt 
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Adding ARGVt for the case when the high-probabihty 
statements fail still leaves the same bound. □ 

It follows from the proof that we have the more precise 
but cumbersome upper bound on E [Regret]: 

(5) 



A. 3. Encoding During Prediction Time 

We use the following well- known inequality, which is 
a direct corollary of the Azuma Hocffding inequality. 
For a proof, see (Chung & Lu, 2006). 

Theorem A. 2. Let Xi, . . . ,Xd be independent ran- 
dom variables such that for each i, there is a con- 
stant Ci such thai |X,; — E [X;] | < always. 
Let X = Yfi=i^r- Then Vy[\X -¥.[X]\>t] < 
2exp{-iV2Eicf}. 

An immediate consequence is the following large devi- 
ation bound on 6 = \^ ■ X — $ ■ x\: 

Lemma A. 3. Let j3 be a model obtained from /? using 
unbiased randomized rounding to a precision e grid. 
Fix X, and let Z = $ ■ x be the random predicted log- 
odds ratio. Then 



Pr[\Z - P ■ x\>t]<2exp 



2e2| 



a;||o 



Lemmas 4.1 and 4.2 provide bounds in terms of the 
quantity \P-x—$-x\. The former is proved in Section 4; 
we now provide a proof of the latter. 

Proof of Lemma 4.2 We claim that the relative 
error is bounded as 

£.{x,yJ)-C{x,y;l3) s /^x 
^x,y■,|3) ^''> 

or equivalently, that that £{x,y]$) < e^£{x,y; 13), 
where 5 = \f5 ■ x — /S ■ x\ as before. We will argue 
the case in which y = I; the y — case is analogous. 
Let z = p ■ X, and z = $ ■ x; then, when y = 1, 



Cix,y, 



log(l + exp(-z)), 



and similarly for /3 and z. liz> z then C{x, y; (3) is less 
than C{x,y]f3), which immediately implies the claim. 
Thus, we need only consider the case when z = z — 5. 
Then, the claim of Eq. (6) is equivalent to 



or equivalently, 

1 + exp {-z + <5) < (1 + exp . 

Let w = exp ((5) and u = exp(— z). Then, we can 
rewrite the last line as 1 + wu < (l + u)"", which is true 
by Bernoulli's inequality, since u > and w >1. □ 

Proof of Theorem 4.3 Let R = ^(^'Vf )-^(^'V'0) 
denote the relative error due to rounding, and let R{5) 
be the worst case expected relative error given S = 
1/3 • X - /9 • x|. Let R = e^ -1. Then, by Lemma 4.2, 
R{5) < R{S). It is sufficient to prove a suitable upper 
bound on E [R] . First, for r > 0, 

Pr[R >r]= Pr[e'' - 1 > r] 

= Pr[S > log(r + 1)] 

Using this, we bound the expectation of R as follows: 



Pr[R>r]dr 
<2f exp^-^°S^(^ + ^) 

ir=0 



2e2||a;||o 



dr, 



and since the function being integrated is non-negative 
on (— l,oo), 



< 2 



log'(r + l) 
2e2||a;||o 



dr 



= 2^2^11x110 exp ( ) e. 



where the last line follows after straightforward cal- 
culus. A slightly tighter bound (replacing the leading 
2 with 1 + Erf(eyp^/V2)) can be obtained if one 
docs not make the change in the lower limit of inte- 
gration. □ 



log (1 -I- exp {-z -h S)) < exp{6) log (1 -|- exp {-z)) , 
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